Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Azure IMDS Url in InstanceMetadataService initialization #4600

Merged
merged 1 commit into from
Jan 13, 2022
Merged

Conversation

whites11
Copy link

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

In CA version 1.21, on Azure, it was possible to omit the Subscription ID completely.
On startup, cluster autoscaler used to detect it using Azure Instance Metadata Service (IMDS).

It used to do so using a service in legacy-cloud-providers.

The initialization code for such service is as follows:

metadataService, err := providerazure.NewInstanceMetadataService(metadataURL)

In 1.21 this worked just fine.

In 1.22 the meaning of the metadataURL parameter was changed in the service code.
What used to be a full URL including path in the version used in 1.21 http://169.254.169.254/metadata/instance was changed to a root path in the version used in 1.22.

This change was not picked up in CA code and this causes CA to panic on startup when the Subscription ID is not provided:

F0112 08:57:17.405106       1 azure_cloud_provider.go:167] Failed to create Azure Manager: failure of getting instance metadata with response "404 Not Found"

Because the computed IMDS URL contains the path twice:

http://169.254.169.254/metadata/instance/metadata/instance

obviously ending up with a 404 from the IMDS endpoint.

This easy PR fixes just that by passing the correct value to the NewInstanceMetadataService function.

Which issue(s) this PR fixes:

None that I could find.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed a panic in Cluster Autoscaler on Azure when no Subscription ID was provided by the user.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 12, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @whites11!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 12, 2022
@whites11
Copy link
Author

whites11 commented Jan 12, 2022

The failing test sounds like a flake to me, or at least I doubt my change introduced it.

@marwanad
Copy link
Member

Seems like the storage account team disabled an endpoint that was previously used for testing. You can fix the unit test by cherry-picking: #4594.

@nilo19 can you see if there's value in that unit test given the above and remove it from all supported branches if not?

@marwanad
Copy link
Member

This PR looks good for me.

@whites11 I think we have removed IMDS auth in latest CA master - do you see value in bringing that back?

@whites11
Copy link
Author

This PR looks good for me.

@whites11 I think we have removed IMDS auth in latest CA master - do you see value in bringing that back?

I totally do. I have a pr in the works to add it back.

@nilo19
Copy link
Member

nilo19 commented Jan 13, 2022

@whites11 let's disable the flaky test before completely fixing it.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 13, 2022
@whites11
Copy link
Author

@whites11 let's disable the flaky test before completely fixing it.

done

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: feiskyer, whites11

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 13, 2022
@k8s-ci-robot k8s-ci-robot merged commit 1c2491d into kubernetes:cluster-autoscaler-release-1.22 Jan 13, 2022
@whites11 whites11 deleted the azure-metadata-fix branch January 13, 2022 10:09
@whites11
Copy link
Author

Thanks for the fix.

/lgtm /approve

Thanks for quick review&merge.
How do I go about asking for a 1.22.3 release?

@feiskyer
Copy link
Member

+@MaciekPytel what's our plans for next patch release? is it possible to cut that by request?

@MaciekPytel
Copy link
Contributor

Usually we cut them when requested via sig. Probably the best way to request is by adding an item to sig-meeting agenda.

How critical is this fix? We just did patch releases in late December, doing another one right now would be a bit soon. But if it's critical than we can bring it up in today sig meeting and set the cut date for next week (to give other providers some time to get their patches in).

@whites11
Copy link
Author

Usually we cut them when requested via sig. Probably the best way to request is by adding an item to sig-meeting agenda.

How critical is this fix? We just did patch releases in late December, doing another one right now would be a bit soon. But if it's critical than we can bring it up in today sig meeting and set the cut date for next week (to give other providers some time to get their patches in).

Hey @MaciekPytel thanks a lot for your answer.
I don't know how you define "critical".
There is a use case (not sure how common it is, but it is the use case we have at Giant Swarm) that is broken (cluster autoscaler crash looping).
There are workarounds, sure, but from our point of view this is critical.

Will add an entry to the agenda and see what happens.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants